{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib\n",
    "import seaborn as sns\n",
    "matplotlib.rcParams['savefig.dpi'] = 144"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from static_grader import grader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ML Miniproject\n",
    "## Introduction\n",
    "\n",
    "The objective of this miniproject is to exercise your ability to create effective machine learning models for making predictions. We will be working with credit card data from Taiwan, predicting whether customers will default based on their recent billing data as well as demographics.\n",
    "\n",
    "## Scoring\n",
    "\n",
    "In this miniproject you will submit the predictions of your model to the grader. The grader will assess the performance of your model using a scoring metric, comparing it against the score of a reference model. We will use the [average precision score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html). If your model performs better than the reference solution, then you can score higher than 1.0.\n",
    "\n",
    "## Downloading the data\n",
    "\n",
    "We can download the data set from Amazon S3:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: cannot create directory 'data': File exists\r\n"
     ]
    }
   ],
   "source": [
    "!mkdir data\n",
    "!aws s3 sync s3://dataincubator-wqu/mldata/ ./data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll load the data into a Pandas DataFrame and pop out the target labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv('./data/UCI_Credit_Card_train.csv', index_col=False)\n",
    "target = data.pop('default.payment.next.month')\n",
    "\n",
    "test = pd.read_csv('./data/UCI_Credit_Card_test.csv', index_col=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>LIMIT_BAL</th>\n",
       "      <th>SEX</th>\n",
       "      <th>EDUCATION</th>\n",
       "      <th>MARRIAGE</th>\n",
       "      <th>AGE</th>\n",
       "      <th>PAY_0</th>\n",
       "      <th>PAY_2</th>\n",
       "      <th>PAY_3</th>\n",
       "      <th>PAY_4</th>\n",
       "      <th>PAY_5</th>\n",
       "      <th>...</th>\n",
       "      <th>BILL_AMT3</th>\n",
       "      <th>BILL_AMT4</th>\n",
       "      <th>BILL_AMT5</th>\n",
       "      <th>BILL_AMT6</th>\n",
       "      <th>PAY_AMT1</th>\n",
       "      <th>PAY_AMT2</th>\n",
       "      <th>PAY_AMT3</th>\n",
       "      <th>PAY_AMT4</th>\n",
       "      <th>PAY_AMT5</th>\n",
       "      <th>PAY_AMT6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>50000.0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>34</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>10243.0</td>\n",
       "      <td>10826.0</td>\n",
       "      <td>11699.0</td>\n",
       "      <td>10146.0</td>\n",
       "      <td>3000.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>2000.0</td>\n",
       "      <td>2000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>80000.0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>43</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>40350.0</td>\n",
       "      <td>38812.0</td>\n",
       "      <td>33459.0</td>\n",
       "      <td>31650.0</td>\n",
       "      <td>1582.0</td>\n",
       "      <td>15350.0</td>\n",
       "      <td>1491.0</td>\n",
       "      <td>3459.0</td>\n",
       "      <td>1650.0</td>\n",
       "      <td>9028.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>200000.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>36</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>4579.0</td>\n",
       "      <td>5840.0</td>\n",
       "      <td>5603.0</td>\n",
       "      <td>6352.0</td>\n",
       "      <td>2500.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1500.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>1000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>280000.0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>50</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>6781.0</td>\n",
       "      <td>5725.0</td>\n",
       "      <td>3989.0</td>\n",
       "      <td>2599.0</td>\n",
       "      <td>11574.0</td>\n",
       "      <td>7104.0</td>\n",
       "      <td>5857.0</td>\n",
       "      <td>3989.0</td>\n",
       "      <td>2599.0</td>\n",
       "      <td>27192.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>150000.0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>51</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>148393.0</td>\n",
       "      <td>149709.0</td>\n",
       "      <td>107862.0</td>\n",
       "      <td>108623.0</td>\n",
       "      <td>7000.0</td>\n",
       "      <td>7600.0</td>\n",
       "      <td>6000.0</td>\n",
       "      <td>4000.0</td>\n",
       "      <td>4100.0</td>\n",
       "      <td>4300.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 23 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4  \\\n",
       "0    50000.0    2          2         1   34      0      0      2      0   \n",
       "1    80000.0    1          2         1   43      0      0      0      0   \n",
       "2   200000.0    1          1         1   36      0      0      2      2   \n",
       "3   280000.0    2          2         2   50     -1     -1     -1     -1   \n",
       "4   150000.0    2          2         1   51      0      0      0      0   \n",
       "\n",
       "   PAY_5    ...     BILL_AMT3  BILL_AMT4  BILL_AMT5  BILL_AMT6  PAY_AMT1  \\\n",
       "0      0    ...       10243.0    10826.0    11699.0    10146.0    3000.0   \n",
       "1      0    ...       40350.0    38812.0    33459.0    31650.0    1582.0   \n",
       "2      2    ...        4579.0     5840.0     5603.0     6352.0    2500.0   \n",
       "3     -1    ...        6781.0     5725.0     3989.0     2599.0   11574.0   \n",
       "4      0    ...      148393.0   149709.0   107862.0   108623.0    7000.0   \n",
       "\n",
       "   PAY_AMT2  PAY_AMT3  PAY_AMT4  PAY_AMT5  PAY_AMT6  \n",
       "0    1000.0    1000.0    1000.0    2000.0    2000.0  \n",
       "1   15350.0    1491.0    3459.0    1650.0    9028.0  \n",
       "2       0.0    1500.0       0.0    1000.0    1000.0  \n",
       "3    7104.0    5857.0    3989.0    2599.0   27192.0  \n",
       "4    7600.0    6000.0    4000.0    4100.0    4300.0  \n",
       "\n",
       "[5 rows x 23 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0\n",
       "1    0\n",
       "2    1\n",
       "3    0\n",
       "4    0\n",
       "Name: default.payment.next.month, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>LIMIT_BAL</th>\n",
       "      <th>SEX</th>\n",
       "      <th>EDUCATION</th>\n",
       "      <th>MARRIAGE</th>\n",
       "      <th>AGE</th>\n",
       "      <th>PAY_0</th>\n",
       "      <th>PAY_2</th>\n",
       "      <th>PAY_3</th>\n",
       "      <th>PAY_4</th>\n",
       "      <th>PAY_5</th>\n",
       "      <th>...</th>\n",
       "      <th>BILL_AMT3</th>\n",
       "      <th>BILL_AMT4</th>\n",
       "      <th>BILL_AMT5</th>\n",
       "      <th>BILL_AMT6</th>\n",
       "      <th>PAY_AMT1</th>\n",
       "      <th>PAY_AMT2</th>\n",
       "      <th>PAY_AMT3</th>\n",
       "      <th>PAY_AMT4</th>\n",
       "      <th>PAY_AMT5</th>\n",
       "      <th>PAY_AMT6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10000.0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>44</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-2</td>\n",
       "      <td>-2</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1275.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>90000.0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>33</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>59753.0</td>\n",
       "      <td>61755.0</td>\n",
       "      <td>63629.0</td>\n",
       "      <td>54423.0</td>\n",
       "      <td>2100.0</td>\n",
       "      <td>4000.0</td>\n",
       "      <td>3000.0</td>\n",
       "      <td>3000.0</td>\n",
       "      <td>2323.0</td>\n",
       "      <td>2000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>140000.0</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>29</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>142140.0</td>\n",
       "      <td>142111.0</td>\n",
       "      <td>194934.0</td>\n",
       "      <td>95484.0</td>\n",
       "      <td>5600.0</td>\n",
       "      <td>6150.0</td>\n",
       "      <td>5900.0</td>\n",
       "      <td>4000.0</td>\n",
       "      <td>4000.0</td>\n",
       "      <td>50000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>340000.0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>37</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>16308.0</td>\n",
       "      <td>21065.0</td>\n",
       "      <td>20581.0</td>\n",
       "      <td>15936.0</td>\n",
       "      <td>8641.0</td>\n",
       "      <td>16400.0</td>\n",
       "      <td>15000.0</td>\n",
       "      <td>15000.0</td>\n",
       "      <td>10000.0</td>\n",
       "      <td>10000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>60000.0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>41</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>14232.0</td>\n",
       "      <td>14676.0</td>\n",
       "      <td>1976.0</td>\n",
       "      <td>2976.0</td>\n",
       "      <td>1518.0</td>\n",
       "      <td>1556.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>300.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 23 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4  \\\n",
       "0    10000.0    2          2         1   44      0      0      0     -2   \n",
       "1    90000.0    2          1         2   33      0      0      0      0   \n",
       "2   140000.0    1          3         2   29      0      0      0      0   \n",
       "3   340000.0    2          1         2   37     -1      0     -1      0   \n",
       "4    60000.0    2          2         2   41      0      0      0      0   \n",
       "\n",
       "   PAY_5    ...     BILL_AMT3  BILL_AMT4  BILL_AMT5  BILL_AMT6  PAY_AMT1  \\\n",
       "0     -2    ...           0.0        0.0        0.0        0.0    1275.0   \n",
       "1      0    ...       59753.0    61755.0    63629.0    54423.0    2100.0   \n",
       "2      0    ...      142140.0   142111.0   194934.0    95484.0    5600.0   \n",
       "3      0    ...       16308.0    21065.0    20581.0    15936.0    8641.0   \n",
       "4      0    ...       14232.0    14676.0     1976.0     2976.0    1518.0   \n",
       "\n",
       "   PAY_AMT2  PAY_AMT3  PAY_AMT4  PAY_AMT5  PAY_AMT6  \n",
       "0       0.0       0.0       0.0       0.0       0.0  \n",
       "1    4000.0    3000.0    3000.0    2323.0    2000.0  \n",
       "2    6150.0    5900.0    4000.0    4000.0   50000.0  \n",
       "3   16400.0   15000.0   15000.0   10000.0   10000.0  \n",
       "4    1556.0    1000.0     300.0    1000.0       0.0  \n",
       "\n",
       "[5 rows x 23 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 1: billing_model\n",
    "\n",
    "The most predictive aspect of the data set is the customer's billing history. Build a simple model that predicts whether a customer will default based only on the billing data. The model should implement a fit method that receives a DataFrame with the fields `'LIMIT_BAL'`, `'PAY_x'` (x = 0--6, except for 1), `'BILL_AMTx'` (x = 1--6), and `'PAY_AMTx'` (x = 1--6) as its feature matrix and the target labels. The model should also implement a predict method that receives the same features and returns _predicted label probabilities_. In most `sklearn` estimators this will be called `predict_proba`. It is important that you return predicted probabilities for compatibility with the ROC AUC metric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "scaler1=StandardScaler()\n",
    "data=pd.DataFrame(scaler1.fit_transform(data), columns=data.columns)\n",
    "\n",
    "scaler2=StandardScaler()\n",
    "test=pd.DataFrame(scaler2.fit_transform(test), columns=test.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "df=data.drop(['SEX', 'EDUCATION', 'MARRIAGE','AGE'], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "billing_model=LogisticRegression()\n",
    "billing_model.fit(df, target)\n",
    "\n",
    "test_df=test.drop(['SEX', 'EDUCATION', 'MARRIAGE','AGE'], axis=1)\n",
    "predictions=billing_model.predict_proba(test_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bill_predictions():\n",
    "    return predictions[:, 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "billing_data = ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bill_predictions():\n",
    "    return np.random.random(3000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==================\n",
      "Your score:  0.898219113148\n",
      "==================\n"
     ]
    }
   ],
   "source": [
    "grader.score('ml__billing_model', bill_predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 2: balanced_billing\n",
    "\n",
    "Default is rare, but we want to be sure to catch likely defaults before they happen; that is, we want high recall. What is the recall of your model? It may suffer due to class imbalance. Investigate the recall of your model and try to optimize it by creating a strategy to deal with class imbalance in the data set.\n",
    "\n",
    "When you've updated your model, submit its `predict_proba` method to the grader."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "from sklearn.utils import shuffle\n",
    "\n",
    "scaler=StandardScaler()\n",
    "\n",
    "df_trans=scaler.fit_transform(df)\n",
    "\n",
    "df=pd.DataFrame(df_trans, columns=df.columns)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'penalty': 'l2', 'C': 0.1}\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.6998313659359191"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(*(df, target), test_size=0.3)\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "model = LogisticRegression()\n",
    "\n",
    "gs = GridSearchCV(model,\n",
    "                  {'penalty': ['l1', 'l2'],\n",
    "                  'C': [.001, .01, .1]},\n",
    "                      cv=5,\n",
    "                      n_jobs=2,\n",
    "                      scoring='neg_mean_squared_error')\n",
    "\n",
    "gs.fit(X_train, y_train)\n",
    "\n",
    "print gs.best_params_\n",
    "\n",
    "gs.best_estimator_\n",
    "\n",
    "model=gs.best_estimator_\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "predictions=model.predict(X_test)\n",
    "\n",
    "from sklearn import metrics\n",
    "\n",
    "metrics.recall_score(predictions, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bal_bill_predictions():\n",
    "    return model.predict_proba(test_df)[:, 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bal_bill_predictions():\n",
    "    return np.random.random(3000)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==================\n",
      "Your score:  0.902307318638\n",
      "==================\n"
     ]
    }
   ],
   "source": [
    "\n",
    "grader.score('ml__balanced_billing_model', bal_bill_predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 3: demo_model\n",
    "\n",
    "Billing data would not be available for prospective customers, but we may want to predict their risk of default if given a line of credit. Construct a model that only takes into account the fields `'SEX'`', `'EDUCATION'`, `'MARRIAGE'`, `'AGE'`, and `'LIMIT_BAL'` (which the creditor controls/knows in advance) to predict default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>LIMIT_BAL</th>\n",
       "      <th>AGE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>50000.0</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>80000.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>200000.0</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>280000.0</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>150000.0</td>\n",
       "      <td>51</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   LIMIT_BAL  AGE\n",
       "0    50000.0   34\n",
       "1    80000.0   43\n",
       "2   200000.0   36\n",
       "3   280000.0   50\n",
       "4   150000.0   51"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df= data[['LIMIT_BAL', 'AGE']]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(27000, 2)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2, 1])"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['SEX'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2, 1, 3, 5, 4, 6, 0])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['EDUCATION'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 3, 0])"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['MARRIAGE'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SEX_1</th>\n",
       "      <th>SEX_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   SEX_1  SEX_2\n",
       "0      0      1\n",
       "1      1      0\n",
       "2      1      0\n",
       "3      0      1\n",
       "4      0      1"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sex = pd.get_dummies(data['SEX'], prefix = 'SEX')\n",
    "sex.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(27000, 2)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sex.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EDUCATION_0</th>\n",
       "      <th>EDUCATION_1</th>\n",
       "      <th>EDUCATION_2</th>\n",
       "      <th>EDUCATION_3</th>\n",
       "      <th>EDUCATION_4</th>\n",
       "      <th>EDUCATION_5</th>\n",
       "      <th>EDUCATION_6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   EDUCATION_0  EDUCATION_1  EDUCATION_2  EDUCATION_3  EDUCATION_4  \\\n",
       "0            0            0            1            0            0   \n",
       "1            0            0            1            0            0   \n",
       "2            0            1            0            0            0   \n",
       "3            0            0            1            0            0   \n",
       "4            0            0            1            0            0   \n",
       "\n",
       "   EDUCATION_5  EDUCATION_6  \n",
       "0            0            0  \n",
       "1            0            0  \n",
       "2            0            0  \n",
       "3            0            0  \n",
       "4            0            0  "
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ed = pd.get_dummies(data['EDUCATION'], prefix = 'EDUCATION')\n",
    "ed.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>MARRIAGE_0</th>\n",
       "      <th>MARRIAGE_1</th>\n",
       "      <th>MARRIAGE_2</th>\n",
       "      <th>MARRIAGE_3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   MARRIAGE_0  MARRIAGE_1  MARRIAGE_2  MARRIAGE_3\n",
       "0           0           1           0           0\n",
       "1           0           1           0           0\n",
       "2           0           1           0           0\n",
       "3           0           0           1           0\n",
       "4           0           1           0           0"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mar = pd.get_dummies(data['MARRIAGE'], prefix = 'MARRIAGE' )\n",
    "mar.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>LIMIT_BAL</th>\n",
       "      <th>AGE</th>\n",
       "      <th>SEX_1</th>\n",
       "      <th>SEX_2</th>\n",
       "      <th>EDUCATION_0</th>\n",
       "      <th>EDUCATION_1</th>\n",
       "      <th>EDUCATION_2</th>\n",
       "      <th>EDUCATION_3</th>\n",
       "      <th>EDUCATION_4</th>\n",
       "      <th>EDUCATION_5</th>\n",
       "      <th>EDUCATION_6</th>\n",
       "      <th>MARRIAGE_0</th>\n",
       "      <th>MARRIAGE_1</th>\n",
       "      <th>MARRIAGE_2</th>\n",
       "      <th>MARRIAGE_3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>50000.0</td>\n",
       "      <td>34</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>80000.0</td>\n",
       "      <td>43</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>200000.0</td>\n",
       "      <td>36</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>280000.0</td>\n",
       "      <td>50</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>150000.0</td>\n",
       "      <td>51</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   LIMIT_BAL  AGE  SEX_1  SEX_2  EDUCATION_0  EDUCATION_1  EDUCATION_2  \\\n",
       "0    50000.0   34      0      1            0            0            1   \n",
       "1    80000.0   43      1      0            0            0            1   \n",
       "2   200000.0   36      1      0            0            1            0   \n",
       "3   280000.0   50      0      1            0            0            1   \n",
       "4   150000.0   51      0      1            0            0            1   \n",
       "\n",
       "   EDUCATION_3  EDUCATION_4  EDUCATION_5  EDUCATION_6  MARRIAGE_0  MARRIAGE_1  \\\n",
       "0            0            0            0            0           0           1   \n",
       "1            0            0            0            0           0           1   \n",
       "2            0            0            0            0           0           1   \n",
       "3            0            0            0            0           0           0   \n",
       "4            0            0            0            0           0           1   \n",
       "\n",
       "   MARRIAGE_2  MARRIAGE_3  \n",
       "0           0           0  \n",
       "1           0           0  \n",
       "2           0           0  \n",
       "3           1           0  \n",
       "4           0           0  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_2 = pd.concat([df, sex, ed, mar], axis=1)\n",
    "df_2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'penalty': 'l1', 'C': 1e-08}\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "LogisticRegression(C=1e-06, class_weight=None, dual=False, fit_intercept=True,\n",
       "          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
       "          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,\n",
       "          verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "scaler=StandardScaler()\n",
    "df_2_trans=scaler.fit_transform(df_2)\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "model = LogisticRegression()\n",
    "\n",
    "gs = GridSearchCV(model,\n",
    "                  {'penalty': ['l1', 'l2'],\n",
    "                  'C': [.00000001, .000001, .00001, .0001, .001, .01]},\n",
    "                  cv=5,\n",
    "                  n_jobs=2,\n",
    "                  scoring='neg_mean_squared_error')\n",
    "gs.fit(df_2_trans, target)\n",
    "print gs.best_params_\n",
    "\n",
    "model=LogisticRegression(penalty='l1', C=.000001)\n",
    "model.fit(df_2_trans, target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df = test[['LIMIT_BAL', 'AGE']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_sex = pd.get_dummies(test['SEX'], prefix = 'SEX')\n",
    "test_ed = pd.get_dummies(test['EDUCATION'], prefix = 'EDUCATION')\n",
    "test_mar = pd.get_dummies(test['MARRIAGE'], prefix = 'MARRIAGE')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>LIMIT_BAL</th>\n",
       "      <th>AGE</th>\n",
       "      <th>SEX_1</th>\n",
       "      <th>SEX_2</th>\n",
       "      <th>EDUCATION_0</th>\n",
       "      <th>EDUCATION_1</th>\n",
       "      <th>EDUCATION_2</th>\n",
       "      <th>EDUCATION_3</th>\n",
       "      <th>EDUCATION_4</th>\n",
       "      <th>EDUCATION_5</th>\n",
       "      <th>EDUCATION_6</th>\n",
       "      <th>MARRIAGE_0</th>\n",
       "      <th>MARRIAGE_1</th>\n",
       "      <th>MARRIAGE_2</th>\n",
       "      <th>MARRIAGE_3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10000.0</td>\n",
       "      <td>44</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>90000.0</td>\n",
       "      <td>33</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>140000.0</td>\n",
       "      <td>29</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>340000.0</td>\n",
       "      <td>37</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>60000.0</td>\n",
       "      <td>41</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   LIMIT_BAL  AGE  SEX_1  SEX_2  EDUCATION_0  EDUCATION_1  EDUCATION_2  \\\n",
       "0    10000.0   44      0      1            0            0            1   \n",
       "1    90000.0   33      0      1            0            1            0   \n",
       "2   140000.0   29      1      0            0            0            0   \n",
       "3   340000.0   37      0      1            0            1            0   \n",
       "4    60000.0   41      0      1            0            0            1   \n",
       "\n",
       "   EDUCATION_3  EDUCATION_4  EDUCATION_5  EDUCATION_6  MARRIAGE_0  MARRIAGE_1  \\\n",
       "0            0            0            0            0           0           1   \n",
       "1            0            0            0            0           0           0   \n",
       "2            1            0            0            0           0           0   \n",
       "3            0            0            0            0           0           0   \n",
       "4            0            0            0            0           0           0   \n",
       "\n",
       "   MARRIAGE_2  MARRIAGE_3  \n",
       "0           0           0  \n",
       "1           1           0  \n",
       "2           1           0  \n",
       "3           1           0  \n",
       "4           1           0  "
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_df_2 = pd.concat([test_df, test_sex, test_ed, test_mar], axis=1)\n",
    "test_df_2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.5,  0.5],\n",
       "       [ 0.5,  0.5],\n",
       "       [ 0.5,  0.5],\n",
       "       ..., \n",
       "       [ 0.5,  0.5],\n",
       "       [ 0.5,  0.5],\n",
       "       [ 0.5,  0.5]])"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict_proba(test_df_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "def demo_predictions():\n",
    "    return model.predict_proba(test_df_2)[:, 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def demo_predictions():\n",
    "    return np.random.random(3000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==================\n",
      "Your score:  2.12992424831\n",
      "==================\n"
     ]
    }
   ],
   "source": [
    "grader.score('ml__demo_model', demo_predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 4: ensemble_model\n",
    "\n",
    "Let's combine the output of our two models in a simple ensemble. That is, take the predicted probabilities of your model based on billing data and your model based on demographic data as inputs for a final estimator that combines them (maybe a simple logistic regression, for instance).\n",
    "\n",
    "You will need to use pipelines and feature unions to accomplish this, because the grader will expect a model that accepts the full feature matrix as input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "df=data.drop(['SEX', 'EDUCATION', 'MARRIAGE','AGE'], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import SGDClassifier\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(*(df, target), test_size=0.3)\n",
    "\n",
    "pipeline = Pipeline([\n",
    "    ('sgd_lr_1', SGDClassifier(loss='log')),('sgd_lr', SGDClassifier(loss='log')),\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python2.7/site-packages/sklearn/utils/deprecation.py:70: DeprecationWarning: Function transform is deprecated; Support to use estimators as feature selectors will be removed in version 0.19. Use SelectFromModel instead.\n",
      "  warnings.warn(msg, category=DeprecationWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Pipeline(steps=[('sgd_lr_1', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,\n",
       "       eta0=0.0, fit_intercept=True, l1_ratio=0.15,\n",
       "       learning_rate='optimal', loss='log', n_iter=5, n_jobs=1,\n",
       "       penalty='l2', power_t=0.5, random_state=None, shuffle=True,\n",
       "       verbose=0, warm...   penalty='l2', power_t=0.5, random_state=None, shuffle=True,\n",
       "       verbose=0, warm_start=False))])"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipeline.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python2.7/site-packages/sklearn/utils/deprecation.py:70: DeprecationWarning: Function transform is deprecated; Support to use estimators as feature selectors will be removed in version 0.19. Use SelectFromModel instead.\n",
      "  warnings.warn(msg, category=DeprecationWarning)\n",
      "/opt/conda/lib/python2.7/site-packages/sklearn/linear_model/base.py:352: RuntimeWarning: overflow encountered in exp\n",
      "  np.exp(prob, prob)\n"
     ]
    }
   ],
   "source": [
    "\n",
    "Pipeline(steps=[('sgd_lr_1', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,\n",
    "       eta0=0.0, fit_intercept=True, l1_ratio=0.15,\n",
    "       learning_rate='optimal', loss='log', n_iter=5, n_jobs=1,\n",
    "       penalty='l2', power_t=0.5, random_state=None, shuffle=True,\n",
    "       verbose=0, warm_start=False))])\n",
    "\n",
    "yhat=pipeline.predict_proba(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "def ensemble_predictions():\n",
    "    return yhat[:,1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==================\n",
      "Your score:  0.401495173738\n",
      "==================\n"
     ]
    }
   ],
   "source": [
    "def ensemble_predictions():\n",
    "    return np.random.random(3000)\n",
    "\n",
    "grader.score('ml__ensemble_model', ensemble_predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Copyright &copy; 2017 The Data Incubator.  All rights reserved.*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  },
  "nbclean": true
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
